Unstructured Data Integration through Automata-Driven Information Extraction

نویسندگان

  • Maroun Abi Assaf
  • Kablan Barbar
  • Youakim Badr
  • Mahmoud Rammal
چکیده

Extracting information from plain text and restructuring them into relational databases raise a challenge as how to locate relevant information and update database records accordingly. In this paper, we propose a wrapper to efficiently extract information from unstructured documents, containing plain text expressed with natural-like language. Our extraction approach is based on the automata formalism to describe the wrapping process running from text documents to Databases. As usual, relevant information in the text document are delimited by regular expressions, which define the extracting automaton. Each automaton is enriched by an output function that automatically generates SQL queries synchronized with the extracting process in order to insert extracted data into database records. We validate our extraction approach with automaton-based prototype to extract legal information about Lebanese official journal decrees and automatically insert them into a relational database.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Unstructured information integration through data-driven similarity discovery

Information integration from multiple heterogeneous sources is one of the major challenges facing enterprises and service providers today, and one of the important problems in this domain is the integration of structured and unstructured (or text) data. In this paper we describe our work on a data-driven approach to integrating various sources of text data, without relying on the availability o...

متن کامل

Ontology-driven Information Extraction

Homogeneous unstructured data (HUD) are collections of unstructured documents that share common properties, such as similar layout, common file format, or common domain of values. Building on such properties, it would be desirable to automatically process HUD to access the main information through a semantic layer – typically an ontology – called semantic view. Hence, we propose an ontology-bas...

متن کامل

Methods for Ontology-Driven Integration

This paper describes the motivations, approach, and architecture for using ontologies in knowledge extraction and in applications that assist situated agents in complex information integration tasks. Our approach applies ontologies along with semantic analysis methods to extract task relevant knowledge from distributed, unstructured text sources. This knowledge is then applied to assist in info...

متن کامل

Ontology Driven Web Extraction from Semi-structured and Unstructured Data for B2B Market Analysis

The Market Blended Insight project has the objective of improving the UK business to business marketing performance using the semantic web technologies. In this project, we are implementing an ontology driven web extraction and translation framework to supplement our backend triple store of UK companies, people and geographical information. It deals with both the semi-structured data and the un...

متن کامل

A Mutually Beneficial Integration of Data Mining and Information Extraction

Text mining concerns applying data mining techniques to unstructured text. Information extraction (IE) is a form of shallow text understanding that locates specific pieces of data in natural language documents, transforming unstructured text into a structured database. This paper describes a system called DISCOTEX, that combines IE and data mining methodologies to perform text mining as well as...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017